class: center, middle, inverse, title-slide # Chicago Taxi Data Analysis ## SCV Group ### ### 27 Oct 2021 --- class: title-slide,middle background-image: url("pikapika.svg"), url("assets/taxi.jpeg") background-position: 10% 90%, 100% 50% background-size: 100px, 100% 100% <!-- background-color: #0148A4 --> ## .black[Hi! During your Journey,] ## .black[Drive safe &] ## .white[Scroll each page please] ## .black[Enjoy! :-)] --- ## Introduction .orange[By Daryna] .scroll-output[ ## Statement of the goal In this report we analyze the Taxi Trips in Chicage 2020 dataset. The aim of the project is to define what variables may influence the revenue of the taxi cabs. We are specifically interested to understand how the cost of trip varies depending on the community area and which areas generate the highest revenue. We then study how the cost of the trip changes through the time variable. ## Brief summary of the approach We first present an exploratory data analysis which allows to familiarize with the data: study the distribution of continous variables, the proportion of categories in categorical variables and the relation between those. We present first the descriptive statistics with respect to the community areas and then with respect to the taxi companies. We then present an interactive maps of Chicago which present which areas depending on the day and time are more or less profitable. ## Introduction to the problem This analysis of similar datasets has been performed before but mostly whith respect to the duration of the trip. We think our analysis may be relevant for taxi companies to facilitate the task of assignment of taxi cabs around the city of Chicago. . ] --- ## Basic Analysis .orange[By Daryna] .scroll-output[ ## Exploratory Data Analysis ## Continuous variables Fare and Trip.Total <img src="./assets/Plots_taxis/1.png" width="1867" /> Let's create logarithms for the continuous variable <img src="./assets/Plots_taxis/2.png" width="1867" /> The distribution of Trip.Total seems to follow a power law sitribution. A lot of data are concentrated between 0 and 100. Some values are very extreme. We should consider to take the logarithm of Trip.Total to visualize the distribution in terms of the percentage. ## Boxplot <img src="./assets/Plots_taxis/3.png" width="1867" /><img src="./assets/Plots_taxis/4.png" width="1867" /> ## Categorical Variables ## Pickup Let's explore now the most pick up community areas in which the taxis are found most often ``` ## Pickup.Community.Area Freq Proportion ## 1 Near North Side 2535 0.27414297 ## 2 (The) Loop[11] 2171 0.23477885 ## 3 Near West Side 1090 0.11787607 ## 4 Morgan Park 761 0.08229696 ## 5 Lake View 344 0.03720125 ## 6 Lincoln Park 195 0.02108792 ## 7 Near South Side 193 0.02087163 ## 8 Garfield Ridge 169 0.01827620 ## 9 O'Hare 159 0.01719477 ## 10 Uptown 148 0.01600519 ## 11 Other 1482 0.16026820 ``` ## Visualisation of Categorical Variables <img src="./assets/Plots_taxis/5.png" width="1867" /> ## DropOff Let's explore now the most dropoff community areas in which the taxis are found most often <img src="./assets/Plots_taxis/6.png" width="1867" /> ## Pie charts ``` ## Var1 Freq Proportion ## 11 Other 1516 0.16362655 ## 1 Uptown 148 0.01600519 ## 2 O'Hare 159 0.01719477 ## 3 Garfield Ridge 169 0.01827620 ## 4 Near South Side 193 0.02087163 ## 5 Lincoln Park 195 0.02108792 ## 6 Lake View 344 0.03720125 ## 7 Morgan Park 761 0.08229696 ## 8 Near West Side 1090 0.11787607 ## 9 (The) Loop[11] 2171 0.23477885 ## 10 Near North Side 2535 0.27414297 ``` <img src="./assets/Plots_taxis/7.png" width="1867" /> ## Pie Chart for Dropoff ``` ## Dropoff.Community.Area Freq Proportion ## 11 Other 1850 0.20489534 ## 1 Edgewater 176 0.01941748 ## 2 Near South Side 213 0.02349956 ## 3 Uptown 217 0.02394086 ## 4 O'Hare 220 0.02427184 ## 5 West Town 288 0.03177405 ## 6 Lincoln Park 387 0.04269638 ## 7 Lake View 391 0.04313769 ## 8 Near West Side 1132 0.12488967 ## 9 (The) Loop[11] 1768 0.19505737 ## 10 Near North Side 2475 0.27305825 ``` <img src="./assets/Plots_taxis/8.png" width="1867" /> ## Distribution by common areas ## Boxplots representing Trip Total of the most common Areas <img src="./assets/Plots_taxis/9.png" width="1867" /> We can see that for areas 56 and 76, it seems like the revenue is higher <img src="./assets/Plots_taxis/10.png" width="1867" /> Even we delete the 2 highest outliers the distributions is very skewed, it's better to continue on the log scale ### Distribution of Dropoff Areas ## Boxplots representing Trip Total of the most common Areas <img src="./assets/Plots_taxis/11.png" width="1867" /> <img src="./assets/Plots_taxis/12.png" width="1867" /> ### Histograms <img src="./assets/Plots_taxis/13.png" width="1867" /> Looks like we have a bimodal distribution <img src="./assets/Plots_taxis/14.png" width="1867" /> ## Hist for Dropoffs <img src="./assets/Plots_taxis/15.png" width="1867" /> <img src="./assets/Plots_taxis/16.png" width="1867" /> ## Realtion between continious variables <img src="./assets/Plots_taxis/17.png" width="1867" /> ### QQ_Plots When plotting theoretical quantiles to empirical quantiles we see that in fact neither Trip Total neither log of trip Total follows normal distribution overall. hen we separate in themost common areas, looks a bit better <img src="./assets/Plots_taxis/18.png" width="1867" /> <img src="./assets/Plots_taxis/19.png" width="1867" /> <img src="./assets/Plots_taxis/20.png" width="1867" /> <img src="./assets/Plots_taxis/21.png" width="1867" /> ### Time series <img src="./assets/Plots_taxis/22.png" width="1867" /> <img src="./assets/Plots_taxis/23.png" width="1867" /> ## For week days <img src="./assets/Plots_taxis/24.png" width="1867" /> <img src="./assets/Plots_taxis/25.png" width="1867" /> <img src="./assets/Plots_taxis/26.png" width="1867" /> <img src="./assets/Plots_taxis/27.png" width="1867" /> ## For dropoff ## Month <img src="./assets/Plots_taxis/28.png" width="1867" /> <img src="./assets/Plots_taxis/29.png" width="1867" /> . ] --- ## Further Visualise Top Zones .orange[By Yiren] .scroll-output[ First of all, we do the basic analysis on the dataset and look at each variable. For the continues variables, they were visualised using ggpair function in GGally package, a powerful tool that gives scatterplot for each pair of the variables and a histogram for each single variable, as well as the correlation between each pair of variables. For instance, we expect to see a correlation between trip_total and trip_miles. In the scatter plot, we can spot an almost linear line and also the outliners. In the correlation cell, we get a correlation coefficient of 0.467, which indicates a moderate positive association. <img src="./assets/1.png" width="5120" /> To further visualise the variable trip_total which is of our main interest, we use a geom_histogram in ggplot2 and carefully select the binwidth. <img src="./assets/4.png" width="1867" /> ## Boxplots: Revenue vs. Zone With regards to discrete variables, we have 77 zones in Chicago and it is too much to plot out. However, since we are interested in the zones the generate most revenue, we plot out the top 10 zones. To visualise the distribution of the data as well as give a comprehensive comparison among zones, we use boxplots in parallel. From the plots below, it is clear that zone 76 and zone 56 generate the most revenue, whereas they are also the two zones where the average avenue reaches the highest, hence they might be where drivers like most. While plotting, we found extreme outliners with trip_total > 200. To better visualise the data, we set yscale to exclude them. <img src="./assets/5.png" width="1867" /> Further, we will prefer using median over mean because we noticed that most of the distributions of revenue are skewed. <img src="./assets/6.png" width="5120" /> As for the dropoff area, it is obvious that zone 76 produced most revenue. <img src="./assets/7.png" width="1867" /> However, we prefer using pickup_community_area over dropoff_comunity_area to answer our question since we see the pickup zone as the location where new revenue are generated and are of driver’s interest. . ] --- ## Total Revenue vs. Zone Analysis by Maps .orange[By Olav] .scroll-output[ To visualize the and highlight the differences between each area we have used interactive maps provided by the leaflet library. We used [this](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6) dataset provided by Chicago Data Portal to visualize the borders between community areas. We attribute a color to each area, which indicates the size of the difference relative to the other areas. Our goal was to get a clear understanding of how area and time of the day affects revenue in each area, and being able to communicate these findings as clearly as possible. As a basis for this analysis we have plotted a map centered around Chicago. This was done by using the Leaflet library to extract a map over Chicago through Google’s map API. Thereafter we extracted `pickup_community_area`, `community_name` and `geometry` from the dataset of community borders. `pickup_community_area` is a discrete numerical variable uniquely identifying each area and community_name is a string variable which contains its corresponding name. geometry is a MultiPolygon object, which essentially is a list of coordinates representing the borders of each area. We will first look into how total revenue is distributed among the areas. The relevant variables we will use are `pickup_community_area` and `trip_total`. `pickup_community_area` is identical to its corresponding variable in the area dataset. trip_total is a continuous numeric variable representing the entire cost of the taxi trip for the passenger. To indicate the total revenue in each area, we summarized every trip_total in each area. We then colored the areas in shades of blue representing the size of their respective revenues. Since the revenue differences between areas often vary by several orders of magnitude, we made the bins logarithmic. We must then interpret the coloring as an indication of the order of magnitude of the revenue, not the exact revenue.
. ] --- ## Total Revenue vs. Zone Analysis by Maps .orange[By Yiren] .scroll-output[ After looking at the revenue map, we wonder why some of places generate more revenue than others? To further explore the reason, we introduce Open Street Map datasest about the amenities as well as the population data from Wikipedia. ## Population We consider population as one of the main factors that affects the total avenue of each zone. Hence we extract the population, density and area data from [Chicago's Community Wikipedia page](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) and join it by using the zone number as the key into our original taxi dataset. We apply logarithmic function on population to make the map more sensible. By comparing the two maps, we spot similarities between the population map and the trip total map. We think the population might be one of the exploratory variables to the total revenue and we later investigate the relationship between the factors by multiple regression.
## Load Open Street Map Data Apart from population, we think the location of certain city amenities might be another main factor that affects the total revenue of each zone. Hence we extract amenity information including theatre, restaurant and hospital from Open Street Map Data (osmdata) and plot those points out using semitransparent points. By plotting on top of Chicago's toner background, we find the places with most clustered orange dots are the most popular areas. <img src="./assets/2.png" width="1867" /> We would like to reduce redundant information on the map. So we extract the street data from osm first, and manually select the only roads that allows cars to drive through, as well as exclude paths such as living street and footway. These roads are plotted .brown[in black], while the darker line indicate a bigger street, highway for instance. Then, we need to identify which zones these dots belong to, so we join the shape file that gives the geometry information of each zones of Chicago and plot out the boundary of each zones .blue[in blue] by calling `geom_sf()`, which allows us to draw different geometric objects. In the end, we again plot out the places of amenities using points .orange[in orange]. Now, we can clearly identify which zones are the most popular ones. <img src="./assets/3.png" width="1867" /> By comparing this graph with the former total revenue one, we think the reason for higher revenue in certain zone might be related to the density of certain amenities. The only exception is O’Hare, where not many amenities are gathered but a hot spot for taxis. It turned out to be that O’Hare is the location of Chicago O'Hare International Airport, which explains why it attracts most of the taxis and produces such a high revenue. As an aside, due to the limit of the OSM dataset, we cannot get all the places where people are likely to take a taxi. However, the current information is enough for us to tell where the city centre is. ## Visualise 24hrs Changes of Total Revenue by Animation Instead of simply looking at the total revenue per zone, to get more useful output that help us better identify popular areas as well as potentially help taxi driver to find most profitable zone within a specific time frame, we split one day into 24 hrs and plot out the total revenue per zone, per hour. We plot the polygons of communities like before, and then and we group them by hours. Then from the timestamp, we get which hour of time each record locates in. We group the records by hours and plot the related total revenue of that hour on our map. The use .green["Greens"] colour as the background, which is a continuous colour variable that indicates the changes of logrithmic total revenue (log 10 is in use), which will be more sensitive to changes in areas with lower total revenue. As for yellow bubbles, it uses the original value of total revenue of that hour, and it is more sensitive to areas with larger total revenue. Combining the two, we have a better visual perception of how the total revenue is changes throughout the 24 hours. The animation is generated by concatenating 24 static maps of each hour. <!-- --> From the animation above, we spot several zones are of the highest interest almost throughout the days, which again are mostly in the city centre as well as the airport. . ] --- ## Shiny App: Trip data by pickup/dropoff areas of Chicago .orange[By Thomas] .scroll-output[ .orange[For a better view of this page, you can go to https://uberlu.shinyapps.io/report1_thomas/] ### by pickup areas of Chicago In this section we will comment an interactive map of Chicago that displays the _sum_, _mean_ and _max_ for the time of the day we want from 1 A.M. to 12 A.M. for each pickup areas of the city. ```r shinyApp(ui = ui, server = server) ``` .orange[Please click: https://uberlu.shinyapps.io/Pickup/] For the summation of the revenue per area, we see that for most of the time it is the city center (_Near West Side_, _Loop_ and _Near North Side_) and the airports (_O'Hare_ and _Garfield Ridge_) that have higher values, except during the night for which the values go down for the airports. This could be due to the fact that there is less flights coming to and going from airports at night. For the city center, it should be due to the fact that in every city there is more activity in the center of it. For the mean of the revenue per area, we can see that it is quite homogeneous on the border of Michigan Lake. But airports are the areas with some of the higher mean, as before, during the day. For the max of the revenue per area, we see that the maximal value attained for each time of the day comes from city center, particularly from _Near North Side_ and _Loop_, values that are more than 5000$ most of the time. This could be due to the fact that people that pick a taxi from the city center have a long journey to do. ### by drop-off areas of Chicago This section is similar to the previous one, but instead of looking at pickup areas, we look at drop-off areas and end time of each trip. As before, we can take a look at each hour of the day from 1 A.M. to 12 A.M. ```r shinyApp(ui = ui, server = server2) ``` .orange[Please click: https://uberlu.shinyapps.io/Dropoff/] For the summation part, we can see quit the same as for the pickup map but the highest values are less concentrated in the city center and more scattered around it. We have the same conclusion for the airports. For the mean case, the values are a lot more homogeneous than for pickups and dispersed over all the city. For the maximal values, we see that the higher values appearing are from the city center to _Lake View_, an area in the north. As before, airports have high values. To conclude, this was a brief analyze of the basis data in order to have an idea of the situation for both pickup and drop-off. We see that it would be better to analyze deeper the pickup case if we wish to answer our questions. . ] --- ## How can a driver earn more? .orange[By Olav] .scroll-output[ From the map it is clear that a few areas generate a disproportionate amount of the revenue, with Loop, O'Hare and Near West Side being the most prominent. Does this imply that every driver should strive to spend as much time as possible in these areas? Not necessarily. The total revenue depends on several factors which differ between areas. For each area there are e.g. differences in population, movement of people and number of active taxis. Although an area generates the most money overall, it doesn't necessarily imply that each taxi driver is better off in that area. Just think of the large queues of taxis which are normal to see at airports - not exactly time efficient! If we start by neglecting the variations of average trip time and revenue between areas, the average waiting time in each area will be a fairly good representation of how attractive the area is for each individual driver; as the drivers only earn money when having a passenger. We represent the average waiting time as the average time it takes from a taxi drops someone off in an area, until the same taxi picks someone else up in the same area. We only consider pick-ups and drop-offs which happen in the same area. This is because taxis which travel from one area to another in between drop-off and pick-up will inflate the average time. This could potentially lead to the central areas getting higher averages, as taxi drivers are inclined to travel back to the city center after a drop-off. We extract `taxi_id`, `trip_start_timestamp`, `trip_end_timestamp`, `pickup_community_area` and `dropoff_community_area` from our data set. `taxi_id` is an unique string identifier for each taxi and `pickup_community_area` and `dropoff_community_area` are identical to the previous description of `pickup_community_area`. `trip_start_timestamp` and `trip_end_time` stamp are date strings one the format **MM/DD/YYYY hh:mm:ss**, which indicates time of trip start and end. We note that the time stamps has been rounded to nearest 15 minute for privacy purposes. Therefore the result will not be 100% accurate, but it will give a good estimate of the true time to pickup. The results are shown below.
By considering average waiting time we get a fairly different picture. The revenue and activity plot previously pointed towards O'Hare and the central areas around Loop as the most attractive - or at least as generating the most money. We can now see that the taxis in these areas are more prone to waiting time, with O’Hare averaging 95 minutes between trips. This is probably an indication of higher competition. However, none of these plots paint the whole picture on its own. Although the waiting time at O’Hare airport is large, a trip from the airport to the city center would generate much more revenue than a trip from Loop to West Town. To take this into account we will go back to the factors we previously neglected; average trip total and average trip time. When combining these our goal is to get a measure of \textit{expected} revenue per hour. We can decompose revenue per hour into: \begin{align} \textit{revenue per hour} = \frac{\textit{revenue per trip}}{\textit{time per trip}} \end{align} We will further view the time per trip as not only the length of the trip, but the sum of trip time and waiting time. We then obtain: \begin{align} \textit{revenue per hour} = \frac{\textit{revenue per trip}}{\textit{waiting time + trip time}} \end{align} To obtain the expected revenue per hour for each area, we simply calculate the average of the three components by area and combine them. To measure the accuracy of our estimator we used the library boot to generate 1000 random samples from each area, and calculate the average hourly rate. Since the function boot.ci didn't work on the data set, we extracted a 95% confidence interval by sorting the sampled values in each area, and taking the 25th and 975th largest values. The obtained result and code for bootstrapping can be viewed below.
. ] --- ## Under the hood, why is it the case? .orange[By Yiren] .scroll-output[ ## Visualise Factors on the Chicago Map To have a general idea of how each factor look like throughout zones, we map each factor to the zone polygon. .pull-left[ <img src="./assets/8.png" width="1867" /> ] .pull-right[ <img src="./assets/9.png" width="1867" /> ] To find out the reason why specific zones generate most revenue as well as why some zones give less pick up time, we join several other dataset downloaded from Chicago governmental website and look for correlation among the factors. Initially when we plot the correlation out, we can clearly see outliners that greatly affect the correlation coefficient, hence we drop them and plot it again. For this plot, .orange[you can click on each cell and see a correlation between the variables]. Our dependent variable `trip_total` is placed at the last column, it has negative correlations with `percent_aged_under_18_or_over_64`, `percent_aged_25_without_high_school_diploma`, `percent_of_housing_crowded`, positive correlation with `per_capita_income`, `population` and `density`. As for `avg_pickup_time` placed at the second last column, it has a negative correlation with `percent_households_below_poverty` and `percent_aged_16_unemployed`.
## Multiple Regression by Backward Model Selection First, we perform multiple regression for trip_total. To find the most appropriate model, we start with the model containing all possible explanatory variables. And progressively, we remove the least informative variable and stop until all variables in the current model are significant when `\(\alpha = 0.05\)`. We do a backward search using Akaike Information Criterion (AIC) and get the following model. ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 700722.131 212619.909 3.296 0.002 ## population 3.136 1.379 2.273 0.026 ## percent_of_housing_crowded -27518.550 6744.633 -4.080 0.000 ## percent_aged_16_unemployed 19959.164 4129.041 4.834 0.000 ## percent_aged_under_18_or_over_64 -19294.557 5914.440 -3.262 0.002 ``` Therefore, we get `$${\text{trip_total}} = 700722.131 + 3.136 \times \text{population} -27518.550 \times \text{percent_of_housing_crowded} + 19959.164 \times \text{percent_aged_16_unemployed} -19294.557 \times \text{percent_aged_under_18_or_over_64}$$` Since the P values for each variable and intercept are all less than 0.05, we keep all of them in our formula for calculating trip_total. ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.45 0.42 0 ``` We get the adjusted r squared to be 0.42, which means 42% of the variance for the dependent variable `trip_total` can be explained by our independent variables, and hence it is a relatively good model. Then, we apply the same process on the `avg_pickup_time`. ``` ## Estimate Std. Error t value ## (Intercept) 15.213 3.098 4.910 ## percent_of_housing_crowded 0.674 0.298 2.262 ## percent_aged_25_without_high_school_diploma -0.184 0.101 -1.813 ## percent_households_below_poverty -0.165 0.051 -3.199 ## percent_aged_under_18_or_over_64 0.130 0.093 1.392 ## Pr(>|t|) ## (Intercept) 0.000 ## percent_of_housing_crowded 0.027 ## percent_aged_25_without_high_school_diploma 0.074 ## percent_households_below_poverty 0.002 ## percent_aged_under_18_or_over_64 0.169 ``` `$${\text{avg_pickup_time}} = 15.2134 + 0.6735 \times \text{percent_of_housing_crowded} -0.1839 \times \text{percent_aged_25_without_high_school_diploma} + -0.1646 \times \text{percent_households_below_poverty} -0.1300 \times \text{percent_aged_under_18_or_over_64}$$` ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.21 0.16 0 ``` We get a nice formula for `avg_pickup_time`, however, the adjusted R squared is just 0.16, which indicates that it is not a good model and we need to find other factors to better explain `avg_pickup_time`, which also suggests a direction for our future work. . ] --- .scroll-output[ .pull-left[ <img src="./assets/8.png" width="1867" /> ] .pull-right[ <img src="./assets/9.png" width="1867" /> ] ]